Boosting

Christian Duffau-Rasmussen

03-10-2021

Boosting - What is it ?

A procedure to combine many weak learners to produce a powerful committee.

(J. Friedman et al. 2009, sec. 10.1)

A brief history

(Valiant 1984)

  • Probably Approximately Correct Learner (PAC)
  • A formal theory of learnability
  • Proof of some broad classes of non-trivial boolean functions are learnable

Side note: PAC learning

  • A concept is Probably Approximately Learnable
    • if in polynomial time an algorithm can deduce a hypothesis
    • with an error-rate \(<\varepsilon\) with probability at least \(1-\delta\)
    • \(s\) is the size of the concept in bits, encoded appropriately
    • for all \(\varepsilon>0\) and \(\delta \leq 1\)
  • The learning protocol is:
    • learn from examples asking an ORACLE
    • the ORACLE returns a random example in unit time (O(1))

A brief history

Leslie Valiant Michael Kearns

(Kearns and Valiant 1989)

  • Introduce weak learner
  • Performs only slightly better than chance
  • Hypothesis boosting problem: weak <=> strong ?

Side note: weak and strong learners

  • strongly learnable == PAC learnable

  • A concept is weakly Learnable

    • if in polynomial time an algorithm can deduce a hypothesis
    • with an error-rate \(<\frac{1}{2} - 1/p(s)\) with probability at least \(1-\delta\)
    • for all \(0 < \delta \leq 1\)
    • where \(s\) is the size of the concept in bits, encoded appropriately

A brief history

(R. E. Schapire 1990)

  • Cracks the Hypothesis boosting problem
  • Shows weak learner <=> strong learner
  • An algorithm constructing a strong learner from weak ones 🤯

A brief history

(Freund 1990)

  • Implements a much more efficient boosting algorithm
  • Trains learners on weighted subsets of the data
  • Uses majority voting to predict

A brief history

(R. Schapire and Freund 1995)

  • Introduces AdaBoost
  • First practical boosting algorithm
  • Has been very effective

A brief history

(J. H. Friedman 2001) (Mason et al. 1999)

  • Generalizes the boosting concept
  • Describes boosting as gradient descent in function space

A brief history

(Chen and He 2015)

  • Wins Kaggle contest on Higgs Boson using XGBoost
  • XGBoost quickly becomes the most winning algorithm

Ensemble methods

  • Bagging: Grow trees using random subsets of data (with replacement)
  • Random forest: Grow trees using random subset of features
  • Boosting: Grow trees on re-weighted dataset

\[\text{Boosting} \succ \text{Random forest} \succ \text{Bagging} \succ \text{Tree}\]

Simulation example

\[Y = \begin{cases} 1 & \text{if}\quad X_1^2 + \ldots + X_{10}^2 > 9.34 \\ -1 & \text{else} \end{cases}\quad X_i\sim N(0,1)\]

Simulated 10-D nested spheres

Simulation example

Bias/variance trade-off

\[\text{MSE} = \text{Bias}(\hat{f})^2 + \text{Var}(\hat{f}) + \text{Irreducible noise}\]

Variance of averages

\[\text{Var}\left(\frac{1}{n}\sum_i^n X_i\right) = \frac{1}{n}\sum_i^n \text{Var}\left(X_i\right) + \frac{1}{n}\sum_{i\neq j} \text{Cov}(X_i, X_j)\]

\[\text{Variance of ensemble} = \frac{\text{Var(Trees)}}{n} + \frac{\text{Cov( Trees)}}{n}\]

Variance and bias reduction

Classifying 10-D nested spheres

Boosting

Forward Stagewise Additive Learning

  1. Initialize \(f_0(x) = 0\)
  2. For \(m=1\) to \(M\):
    1. Compute \[(\beta_m, \gamma_m) = \underset{\beta, \gamma}{\text{argmin}} \sum_{i=1}^N L(y_i, f_{m-1}(x_i)+ \beta b(x_i;\gamma))\]
    2. Set \(f_m(x) = f_{m-1}(x) + \beta_m b(x; \gamma_m)\)

Adaboost

  1. Initialize weights \(w_i=1/N\)
  2. For \(m=1\) to \(M\):
    1. Fit classifier \(G_m(x)\) to training data using \(w_i\)’s
    2. Compute \[\text{err} = \frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^N w_i}\]
    3. Compute \(\alpha_m = \log((1-\text{err})/\text{err})\)
    4. Set \(w_i \leftarrow \exp(\alpha_m I(y_i \neq G_m(x_i)))\)
  3. Output \[G(x) = \text{sign}\left(\sum_{m=1}^M \alpha_m G_m(x)\right)\]

Adaboost

  • Not originally motivated in Forward Stagewise Additive Learning
  • One can show that Adaboost is a stagewise additive model with loss \[L(y, f (x)) = \exp(−y f(x))\] where \(y\in \{-1, 1\}\) and \(f(x) \in \mathbb{R}\).
  • Usually for classification we use cross-entropy \[L(y, f(x)) = y^\prime \log p(x) + (1-y^\prime) \log(1- p(x)) \] where \(y^\prime = (y+1)/2\in \{0,1\}\) and \(p(x)\) is the softmax function.
  • In theory the two loss functions are equivalent
  • On average they produce the same model \(f\)

Adaboost

  • For finite samples the exponential loss has drawbacks
  • To much weight is given to errors
Exponential loss and cross-entropy

Adaboost

  • 10D nested spheres with noisy data
Clean data
Noisy data

Adaboost

  • Adaboost has trouble on noisy data
Test error clean data
Test error noisy data

Adaboost

  • How to fix Adaboost?
  • Can we easily plug in a more robust loss function?
  1. Initialize weights \(w_i=1/N\)
  2. For \(m=1\) to \(M\):
  3. Fit classifier \(G_m(x)\) to training data using \(w_i\)’s
  4. Compute \(\text{err} = \frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^N w_i}\)
  5. Compute \(\alpha_m = \log((1-\text{err})/\text{err})\)
  6. Set \(w_i \leftarrow \exp(\alpha_m I(y_i \neq G_m(x_i)))\)
  7. Output \[G(x) = \text{sign}\left(\sum_{m=1}^M \alpha_m G_m(x)\right)\]

🤷‍♂️

Gradient boosting

  • Introduced by (J. H. Friedman 2001) as a new view of boosting
  • Makes it possible to derive a boosting procedure
  • Solves each forward stagewise step as gradient descent in a function space
  • Applies to any differentiable loss function

Side note: Gradient boosting in SciKit Learn

Scikit-learn documentation:

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.

Sound promising…

loss: {‘deviance’, ‘exponential’}

For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.

🤦‍♂️

Gradient boosting

  • Consider the loss function, \[\sum_{i=1}^N L\left(y_i, \sum_{m=1}^M \beta_m b(x_i, \gamma_m)\right)\]
  • … to be optimized over \(\{\beta_m, \gamma_m\}_{m=1}^M\)
  • … which might me intractable

Gradient boosting

  • We try to solve it in a Forward stagewise mamner
  • For \(1\) to \(M\):
    • Optimize \[\sum_{i=1}^N L\left(y_i, f_{m-1}(x_i) + \beta_m b(x_i, \gamma_m)\right)\]
  • where \(f_m(x) = f_{m-1}(x) + \beta_m b(x_i, \gamma_m)\).
  • If the basis functions are trees
  • … the stagewise minimization ca be difficult

Gradient boosting

  • Consider the expected loss function \[\phi(f) = E[L(y,f(\boldsymbol{x}))]\]
  • we seek \(f\) that minimizes \(\phi\)
  • Thats a difficult variational analysis problem
  • Let’s circumvent the problem…
  • … consider a dataset of \(N\) observations instead \[\mathbf{f} = \{f(x_1), \ldots, f(x_N)\}\]

Gradient boosting

  • Lets minimize the empirical loss \[L(f) = \sum_{i=1}^N L(y_i, f(x_i))\] wrt. \(f(x_i)\) numerically.
  • All numerical optimization is a series of \(M\) steps towards optimum
  • Let each step be a vector \(\mathbf{h}_m\in\mathbb{R}^N\)
  • The numerical optimum is \[\mathbf{f}_M = \sum_{m=1}^M \mathbf{h}_m\]

Gradient boosting

  • In gradient descent each step \(\mathbf{h}_m\) is \[-\rho \mathbf{g}_m\]
  • … where \(g_{im}\) is \[\left[\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\right]_{f(x)=f_{m-1}(x)}\]
  • The numerical optimization is then \[ \begin{align*} \mathbf{f}_m &= \mathbf{f}_{m-1} + \mathbf{h}_m\\ &= \mathbf{f}_{m-1} -\rho \mathbf{g}_m \end{align*} \]

Gradient boosting

  • Let’s add line search, \[\mathbf{f}_m = \mathbf{f}_{m-1} - \rho_m \mathbf{g}_m\] where \(\rho_m = \underset{\rho}{\text{argmin}}\quad L(\mathbf{f}_{m-1}-\rho \mathbf{g}_m )\)
  • This fits the pattern of forward stagewise learning \[f_m(x) = f_{m-1}(x) + \beta_m b(x; \gamma_m)\]
  • where \(\rho_m\) is analogous to \(\beta_m\)
  • and \(-\mathbf{g}_m\) is analogous to \(b(x; \gamma_m)\)
  • We can derive the gradient considering \(f(x)\) a variable
  • … and easily compute \(\mathbf{g}_m(f(x))\) using the value from the previous iteration \(f_{m-1}(x)\)
  • … we cannot compute \(\mathbf{g}_m\) outside the dataset 🤔

Gradient boosting

  • The idea of (J. H. Friedman 2001):
    • Fit a learner to \(-\mathbf{g}_m\) \(\rightarrow \hat{b}_m(x; \gamma_m)\)
    • Forward stagewise step: \[f_m(x) = f_{m-1}(x) + \rho_m \hat{b}_m(x; \gamma_m)\]

XGBoost

  • Explicit regularization in loss function
  • Newton boosting:
    • \(\mathbf{h}_m\) is \[-\rho_m \mathbf{H}_m^{-1}\mathbf{g}_m\]
  • Custom split point identification: “Weighted Quantile Sketch”
  • Computationally optimized

References

Chen, Tianqi, and Tong He. 2015. “Higgs Boson Discovery with Boosted Trees.” In NIPS 2014 Workshop on High-Energy Physics and Machine Learning, 69–80. PMLR.
Freund, Yoav. 1990. “Boosting a Weak Learning Algorithm by Majority.” In Proceedings of the Third Annual Workshop on Computational Learning Theory, 1990.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.
Friedman, Jerome, Trevor Hastie, Robert Tibshirani, et al. 2009. The Elements of Statistical Learning. 2nd ed. New York, NY: Springer.
Kearns, M, and LG Valiant. 1989. “Crytographic Limitations on Learning Boolean Formulae and Finite Automata.” In Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing, 433–44.
Mason, Llew, Jonathan Baxter, Peter Bartlett, and Marcus Frean. 1999. “Boosting Algorithms as Gradient Descent in Function Space.” In Proc. NIPS, 12:512–18.
Schapire, R, and Y Freund. 1995. “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting.” In Second European Conference on Computational Learning Theory, 23–37.
Schapire, Robert E. 1990. “The Strength of Weak Learnability.” Machine Learning 5 (2): 197–227.
Valiant, Leslie G. 1984. “A Theory of the Learnable.” Communications of the ACM 27 (11): 1134–42.